7 

AD-A117  460  TEXAS  A  AND  M  UNIV.  COLLEGE  STATION  INST  OF  STATISTICS  F/G  12/1 

QUANTILES*  PARAMETRIC-SELECT  DENSITY  ESTIMATIONS,  AND  BI-lNFORM— ETC ( U) 
JUN  82  E  PARZEN  DAAG29-80-C-0070 

UNCLASSIFIED  TR-8-6  ARO-16992. 12-MA  NL 


BMC  FILE  COPY  AD  All 7460 


Alb  MU>  lA-toA 

TEXAS  A&M  UNIVERSITY 

COLLEGE  STATION,  TEXAS  77843-3143 


Institute  of  Statistics 

Hhmw  713  -  MS-3141 


QUANTILES ,  PARAMETRIC- SELECT  DENSITY  ESTIMATIONS, 
AND  BI- INFORMATION  PARAMETER  ESTIMATORS 


Emanuel  Parzen 

Institute  of  Statistics,  Texas  ASM  University 


Technical  Report  No.  B-6 
June,  1982 


Texas  A&M  Research  Foundation 
Project  No.  4226 

"Robust  Statistical  Data  Analysis  and  Modeling" 


Sponsored  by  the  U.S.  Army  Research  Office 
Grant  DAAG29-80-C-0070 


Professor  Emanuel  Parzen,  Principal  Investigator 


Approved  for  public  release;  distribution  unlimited. 


8  2  Of  26  025\  jul  261982  1 


_ Unclassified _ t  _ 

SECURITY  CLMSjflCATlOW  or  TWIS  >MC  fWKn  Dim  jAwg 

I  REPORT  DOCUMENTATION  PAGE 


REDOUT  NUMBE 


Technical  Report  B-6 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


ECJJUCnT'S  CATALOG  NUW0CR 


4.  TITLE  fAAd  Subllilo) 

Quantiles,  Parametric-select  Density  Estimation, 
and  Bi-information  Parameter  Estimators 


*  type  Of  REPORT  A  PCRlOO  COVEREO 

Technical 


S.  PERFORMING  ORG.  REPORT  NUMBER 


7.  AUTHORS 

Emanuel  Parzen 


t.  CONTRACT  OR  GRANT  NUMBERfaJ 

DAAG29-80-C-0070 


».  PERFORMING  ORGANIZATION  NAME  ANO  ADDRESS 

Texas  A&M  University 
Institute  of  Statistics 
College  Station,  TX  77843 


11.  CONTROLLING  OFFICE  NAME  AND  ADORESS 

Army  Research  Office 


10.  PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  A  WORK  UNIT  NUMBERS 


1Z.  REPORT  DATE 

June  1982 


IS.  NUMBER  OF  PAGES 


4.  MONITORING  AGENCV  NAME  A  ADORCSSflf  different  I 


16.  DISTRIBUTION  STATEMENT  (ot  Nil •  Report) 


i  Controlling  Ottl cm)  |  18.  SECURITY  CLASS,  (at  tMa  report) 


Unclassified 


15a.  DECLASSIFICATION/ DOWN  GRADING 
SCHEDULE 


Approved  for  public  release;  distribution  unlimited. 


17.  DISTRIBUTION  STATEMENT  (ol  thm  abafraet  anfaraif  In  Block  20,  If  dllloront  from  Ropott ) 


IS.  SUPPLEMENTARY  NOTES 

The  findings  in  this  report  are  not  to  be  construed  as  an  official 
Department  of  the  Army  position,  unless  so  designated  by  other  authorized 
documents . 


ft.  KEY  WORDS  (Contlnuo  on  rararaa  aid*  It  nacaaaaiy  and  Idontlty  by  Mock  ntaikor) 

Statistical  data  science,  functional  inference,  information  divergence  of 
index  a,  bi-information,  comparison  distribution  functions,  density 
estimation.  \  / 


V 


20.  APS1  RACT  (Conflnua  an  rararaa  a  I  da  ft  nacaaaar?  and  Idontlty  by  block  number)]  p  |  5  paper  OU 1 1  I  n6S  d 

quant i te-based  approach  to  functional  inference  problems  in  which  the  parameter! 
to  be  estimated  are  density  functions.  Exponential  models  and  autoregressive 
models  are  approximating  densities  which  can  be  justified  as  maximum  entropy  foi 
respectively  the  entropy  of  a  probability  density  and  the  entropy  of  a  quantile 
density.  It  is  proposed  that  bi- informat  ion  estimation  of  a  density  function 
can  be  developed  by  analogy  to  the  problem  of  identification  of  regression 
models,  xv 


FORM 
I  JAN  72 


COITION  OF  I  NOV  At  It  OBtOLCTC 
S/N  0102*  IF-  014*  6601 


Unclassified 

AprnR'TV  S  AlAlPlCAYt  N  OF  THIS  PAOC  < 


1  Data  BtiaraO 


CONTENTS 


1.  Statistical  Science,  Data  Analysis,  and  Buffalo  Snowfall 

2.  Functions  that  describe  probability  distributions 

3.  Raw  functions  that  describe  samples 

4.  Smooth  functions  that  describe  samples  and  estimate 
probability  distributions 

5.  Parameter  estimation  and  information  divergence 

6.  Information  and  bi-information  parameter  estimation,  and 
comparison  distribution  functions 

7.  Statistical  inference  reduced  to  density  estimation 

8.  Parametric-select  density  estimation  and  maximum  entorpy 
densities 

9.  Exact-parametric  and  parametric-select  estimation  of 
probability  density  functions  using  exponential  models 
Case  studies  of  bi-information  density  estimation 


|  Accession  Foi 

I  "  r*  J 


1.  Statistical  Science,  data  analysis,  and  Buffalo  snowfall 

Statisticians  complain  about  the  failure  of  universities 
to  adequately  educate  students  on  how  to  analyze  statistical 
data.  At  the  same  time  some  statisticians  state  that  data 
analysis  is  an  art,  and  thus  cannot  be  taught.  When  these 
statisticians  speak  of  statistical  science  it  is  difficult  to 
imagine  to  what  they  are  alluding  since  they  seem  to 
sneeringly  reject  all  attempts  to  reason,  and  reach  consensus, 
about  the  evaluation  of  methods  to  be  used  as  part  of  the  process 
of  statistical  data  analysis. 

I  would  like  to  propose  a  data  set  which  I  believe  provides 
a  useful  test  case  for  various  approaches  to  data  analysis, 
namely  the  annual  time  series  of  snowfall  in  Buffalo,  N.Y.  The 
segment  of  that  series  which  I  will  discuss  is  1910-1972, 
although  it  has  many  interesting  features  when  extended  to  1981. 
The  data  analysis  question  to  be  considered  is:  What  probability 
distributions  can  be  used  to  describe  Buffalo  snowfall.  An 
ever-present  hypothesis  to  be  considered  is  whether  Buffalo 
snowfall  is  normal. 

2 .  Functions  that  describe  probability  distributions 

The  probability  law  of  a  continuous  random  variable  X  can 
be  described  by  one  or  more  of  the  following  functions: 


(1)  Distribution  Function  F(x)  -  Pr  [X^x] 

(2)  Probability  Density  Function  f(x)  *  F'(x) 


(3)  Quantile  Function  Q(u)  *  F  (u) 

=  inf  {x:  F(x)  >_  u} 

=  inf  {x:  F(x)  *  u}  if  F  is  continuous 
=  x  such  that^ F(x)  =  u  if  F  increasing  at  x 

(4)  Quantile-Density  Function  q(u)  =  Q' (u) 

(5)  Density -Quantile  Function  fQ(u)  =  f(Q(u)) 

Theorem:  For  F  continuous 

FQ(u)  =  u  ,  fQ(u)  q (u)  =  1 


3.  Raw  functions  that  describe  samples 

Data  X^ . Xn  is  called  a  random  sample  of  X  when 

Xp...,Xn  are  independent  random  variables  identically 
distributed  as  X.  An  important  role  in  the  analysis  of  a  sample 
is  played  by  the  order  statistics  X^^  <  X^2)<---<  ^(n) 

(1)  Sample  Distribution  F(x)  =  fraction  X^ . XR  £  x 

“  n  ’  X(j)-  x<X(j+l) 

(2)  Sample  Probability  Density,  or  Histogram,  estimates 
f(x)  by  a  numerical  derivative 

f<X)  -  lisa* 

(3)  Sample  Quantile  Q(u)  *  F”^(u) 

“  X(j)»  <u  -  n 

A  universal  display  of  any  data  set  is  provided  by  the  quantile 
box  plot  introduced  in  Parzen  (1979). 


(4)  Sample  Quantile-Density  is  a  numerical  derivative 

Q(u-h) 

(5)  Sample  Density-Quantile  =  fQ(u)  =  l/q(u). 

An  important  formula  is 

-  2  {(n+ixx(J+1)-xa.1)))-1 

4.  Smooth  functions  that  describe  samples  and  estimate 
probability  distributions 

The  functions  F,  f,  Q,  q,  fQ  that  represent  the  true 
probability  distribution  of  a  random  variable  X  are  estimated  by 

A  A  A  /"v  A  A 

smooth  functions  F,  f,  Q,  q,  fQ  which  are  derived  from  the  raw 
descriptive  functions  F,  f,  Q,  q,  fQ.  One  distinguishes  between 
parametric  and  non-parametric  methods  of  estimating  smooth 
functions . 

A  parametric  estimation  method  :  (1)  assumes  a  family 
Fq,  f0,  Qq,  qQ,  f0Q0  of  functions,  called  parametric  models, 
which  are  indexed  by  a  parameter  0  *  (0^,...,  0^)  ;  (2)  forms 

A  A  <A 

estimators  0  *  (0^ . 8^)  of  0;  (3)  forms  smooth  functions  by 

F(x)  «  Fg(x) ,  f(x)  =  fg(x) , 

Q(u)  *  Qg(u),  q (u)  -  qg(u), 
fQ(u)  *  fgQg(u). 

A  non-parametric  estimation  method  forms  estimators  which 
are  not  based  on  parametric  models .  Important  examples  of 
non-parametric  estimators  of  a  probability  density  f(x)  and  a 


I 
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quantile-density  q (u)  are  respectively 
f(x)  =  \  S'  K(^)  dF  (x) 

—  oo 

q(u)  =  ^  Z1  K(^)  dQ(u) 
o 

for  suitable  kernels  K(*)  and  bandwidth  6. 

5 .  Parameter  estimation  and  information  divergence 

When  a  parametric  model  fQ  is  assumed,  parameter  estimators 

A  *** 

9  are  often  determined  by  minimizing  a  "distance"  between  f(x) 
and  fQ(x).  A  "distance"  between  two  probability  densities  f(x) 
and  g(x)  is  denoted  I(f;g)  and  is  called  an  information  divergence 
between  f(x)  and  g(x) .  It  is  usually  not  symmetric  in  f  and  g. 

It  does  not  satisfy  the  triangle  inequality  for  a  metric.  But 
it  does  satisfy  I(fj.g)  0  and  I  (f;g)  =0  if  and  only  if  f  =  g. 

The  most  famous,  and  most  important,  definition  of 
information  divergence  is 

I^fi-S)  -  I"  -  loS{f&&>  f<x>  dx 

—  OO  '  ' 

called  the  information  divergence  of  order  1,  or  Kullback- 
Liebler  information  divergence.  Information  divergence  of 
order  a  is  defined  for  a>0  (but  a  ^  1)  by 

Ia(f &g)  =  ^  af(x)  dx- 

v  —  OO  '  * 

The  most  important  values  of  a  are  0.5£a£2. 

Bi- information  divergence  is  defined  by 

II (f ; g)  ■  /"I  log  {|^  >  |  2  f(x)  dx; 
it  may  be  regarded  as  related  to  IjCgjf). 
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Information  divergence  of  order  1  has  an  important  decomposition 
I1(f;g)  =  H (f ; g)  -  H(f) 

defining 

H (f ; g)  =  /°°  (-log  g(x)}  f (x)  dx, 

-  00 

H(f )  =  H(f;f)  =  r  {-log  f (x) }  f (x)  dx. 

—  CO 

We  call  H(f;g)  the  cross-entropy  of  f  and  g,  and  call  H(f)  the 
entropy  of  f. 

Maximum  likelihood  parameter  estimation  can  be  shown  to 
be  equivalent  to  minimum  cross-entropy  estimation.  The 
likelihood  function  of  a  parametric  model  fn  is  defined  by 

L(f0)  =  log  f0(X1 . Xn) 

-  J1  l08  W 

One  may  verify  that 

L(f0)  =  n  J°°  log  f 0 (x)  dF (x) 

CO 

*  -n  H(f ;  fQ)  . 

/\ 

The  maximum  likelihood  parameter  estimator  0,  defined  by 

L<fg)  =  “  L(f0)  , 

clearly  satisfies 

H(f;fg)  -  mJn  H(f;f0). 

It  also  satisfies 

A 

In  general  parameter  estimators  6  are  found  by  minimizing 
IQ(f;f0)  or  I  (fgjf).  Chi-squared  estimators  minimize  IjCfgif) 

*v 

while  modified  chi-squared  estimators  minimize  I9(f;f0). 
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To  compute  I^(f;fQ)  one  needs  to  compute  H(f).  A  useful 
formula  for  accomplishing  this  is 

H(f)  =  /°°{-log  f(x)}  dF (x) 

—  oo 

=  {-log  fQ(u)}  du 
o 

-  J1  log  q (u)  du. 
o 

The  value  of  I^(f;fg)  can  be  used  to  test  the  goodness  of  fit 
of  the  parametric  model  f . . 

6 •  Information  and  bi-information  parameter  estimation,  and 
comparison  distribution  functions 
Given  a  sample  with  sample  probability  density  function  f 
and  parametric  model  fg,  one  can  form  diverse  parameter 

A  V 

estimators,  denoted  0  and  0,  corresponding  to  two  choices  of 
information  divergence  which  we  take  to  be:  (1)  I^(f;f0),  and 

"*  a  y 

(2)  ^(fgjf)  or  II  (f ;  f g)  .  We  call  0  and  0  diverse  parameter 
estimators.  For  greater  precision  we  call  6  the  (order  1) 
information  estimator,  and  5  the  bi- information  estimator. 

When  the  parametric  model  f„  is  exact,  the  diverse 
parameter  estimators  have  equivalent  statistical  properties; 
they  are  both  asymptotically  efficient  estimators,  and  are  not 
significantly  different  from  each  other. 

When  the  values  of  0  and  0  computed  from  a  sample  are 
significantly  different  one  should  suspect  that  the  parametric 
model  f0  does  not  fit  the  data.  The  Shapiro-Wilk  statistics 


for  testing  normality  and  exponentiality  can  be  regarded  as 
comparing  diverse  estimators  which  minimize  information  of 
order  1  and  2  respectively. 

A  V 

One  can  interpret  0  and  9  as  parameter  values  of  "best 
approximating"  models. 

One  wishes  to  evaluate  Fg(x)  and  Fg(x)  as  smooth  estimators 
of  F(x).  For  any  parameter  value  9,  define 

D0(u)  =  F&(Q(u)) 

which  is  the  sample  quantile  function  of  the  transformed 
random  variables 

U1  -  Vxl> . Un  -  W- 

The  true  parameter  value  0  has  the  property  that  U-^ . Un 

are  distributed  with  a  uniform  [0,1]  distribution.  Then 
parameter  estimators  6  and  9  are  compared  by  the  character  of 
the  closeness  to  the  identity  function  D(u)  =  u  of  D^(u)  and 

We  call  DQ(u)  a  comparison  distribution  function.  Its 
derivative 

dQ(u)  =  {D0(u) }' 

plays  a  basic  role  and  is  called  a  comparison  density;  formulas 
for  the  comparison  density  are 

d0(u)  «  f0(Q(u)  q(u) 


fe(Q(u)> 


u  * 
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An  alternative  comparison  density  introduced  in  Parzen 
(1979),  is 

d(u)  =  f0Q0(u)  q (u)  r  oQ, 

o  =  jl  f0Q0(u)  q(u)  du, 
o 

D(u)  =  /u  d(t)  dt 
o 

where  fQQo(u)  is  a  specified  density-quantile  function. 

Parameter  estimators  can  be  justified  as  minimizing 
information  divergence 

IjCdg)  =  f1  -log  d0(u)  du  =  I-l (f ;  f  Q) 
o 

II(de)  =  f1  I  log  dQ (u) | 2  du  =  II(f;f0) 

Ia (dg)  =  log  f1  {dQ(u)  }1_adu 

/X|d  (u)  -  1|2  du  =  f1  |d0(u) |2  du- 1 
o  o'1 

These  measure  the  closeness  to  1  of  d0(u),  or  the  closeness  to 
D(u)  =  u  of  Dq(u).  However  the  final  decision  about  parameter 
estimators  should  be  based  on  visual  inspection  of  the  graph  of 
Q(u>. 


Another  consequence  of  considering  information  of  order 
a  is  that  we  can  unify  the  estimation  criterion  used  to  form 
maximum  likelihood  estimators  with  the  estimation  criterion 


used  to  form  Gaussian  time  series  parameter  estimators: 

isp  <iif,)  -  log  Z1  fjfer  • 

where  f  and  f  are  spectral  densities.  It  is  comparable  to 

\ J 

I2(d  )  =  log  f1  du 

1  O  feQ(u) 


The  quantile  approach  to  statistical  data  analysis  being 
developed  by  Parzen  [since  Parzen  (1979)]  is  based  on  the 
proposition  that  conventional  problems  of  statistical  inference 
concerning  (1)  a  random  sample  X^,...,Xn>  (2)  a  bivariate 

sample  (X^ , Y^) , . . . , (Xn , Yn) ,  or  (3)  two  samples  X^,...,Xm  and 

Y^ . Yn  should  be  transformed  to  problems  of  functional 

inference,  estimating  and  testing  hypotheses  about  density 
functions  d(u),  d(u^  ,  .  .  .  ,  d(u 
0<u<l ,  unit  square  (Ku^.^fl,  unit  hypercube  0<u^ ,  .  .  .  ,  ^<1 .  To 
illustrate  how  this  is  done  consider  the  following  problems. 

Modeling  Bivariate  Data  and  Tests  for  Indpenedence .  Let 
X  and  Y  be  continuous  random  variables  with  joint  density 
function  fx  Y(x,y).  The  hypothesis,  Ho:  X  and  Y  are  independent 
can  be  expressed 

Ho:  fx  y(x,y)  =  f x (x)  fy (y) 

or  in  terms  of  information  divergence 

,co  ,co  fx(x)fy(y) 

I^fX,Y;fXfY^  “  [m  U~lo&  fx,Y(x,y)  ^  fX,Y^x,y)  dx  dy 
by 

Ho:  I(fx  Y;  fxfY)  =  0  . 


on  the  unit  interval 


Define 


D(u^ ,  U2)  ”  f*x  y(Qx(u^)  ,  Qy(u£)) 
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d<ul'u2>  *  3UJ3u2  D<VU2> 

f  ^  Y  ^1^  •  Qy  ^u2  ^  ^ 

^y^Y  ^u2 ^ 

We  call  d(u^,u2)  the  quantile  dependence  density. 

The  hypothesis  Ho  can  be  expressed 

Ho:  D(u1,u2)  -  uiu2*  d(u^,u2>  *  1. 

One  can  verify  that 

Ij^  ( f y  > i d(u^, u2 )  )  d(uj  , u2 )  du^du2 
=  -  H1(d(u1,u2)) 

Thus  estimating  the  information  divergence  between  fx  y  and 

fxfY  is  equivalent  to  estimating  the  negative  of  the  entropy  of 
d(u1(u2) . 

A 

Estimators  dm(u)  dependent  on  a  finite  number  of  parameters 
can  be  formed  from  the  raw  estimator 

D(u^ , u2)  “  y(Qx(u^) ,  Qy(u2)). 

Modeling  likelihood  ratios  and  testing  equality  of 
distributions .  Let  X  and  Y  be  continuous  random  variables. 

The  hypothesis 


Ho:  Fx(x)  -  Fy(x),  or  fx(x)  "  ^y(x) 
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can  be  expressed  in  terms  of  information  divergence 
I(fy-.£x)  -  /_  -log  ^  dFy(y) 

=  -log  d(u)  du 

o 

=  -Hqd  (d(u) ) 

defining  the  comparison  distribution  function  and  comparison 
density  function 

d  fx(Qy(u)) 

D(u)  -  FxQY(u),  d (u)  -  33  D(u)  =  ^(q^u)) 

Estimating  the  information  divergence  between  fy  and  f^  is 
equivalent  to  estimating  the  negative  of  the  entropy  in  the 
quantile-density  sense  of  the  comparison  density  d(u) . 

8 .  Parametric- select  density  estimation  and  Maximum  Entropy 
Densities 

A  density  d(u)  ■  D' (u)  can  be  approximated  in  many  ways 
by  sequences  dm(u) ,m=l, 2, . . .  of  functions  which  converge  to 

A 

d(u).  For  m-1,2,...,  let  dm(u)  be  an  estimator  of  dm(u)  ;  the 

A 

sequence  dm(u)  then  estimates  d(u) . 

If  dm(u)  corresponds  to  a  standard  finite  parameteric 
model  d(u)  for  which  one  could  consider  testing  the  hypothesis 
that  dm(u)  provides  an  exact  model,  we  call  d^u)  a  parametric- 

A 

select  representation,  and  dm(u)  a  parametric-select  estimator, 


to  indicate  that  we  are  free  to  select  the  number  of  parameters 
in  dm(u)  tD  provide  an  adequate  approximation  or  representation 
of  d(u). 

A 

We  call  dm(u)  a  non-parametric  representation,  and  dm(u) 
a  non-parametric  estimator,  if  dm(u)  does  not  correspond  to  a 
standard  finite  parameter  model  which  could  be  interpreted  as 
an  exact  model. 

An  important  criterion  for  developing  the  functional  form 
of  exact  models  for  densities  is  the  maximum  entropy  principles. 

A  density  f(x),  -«kx<°°,  which  maximizes  entropy 
H(f)  =  /“{-log  f(x)}f(x)  dx  subject  to  constraints 


r  Tj(x)  f(x)  dx  =  tj,  j=l . k, 


where  T. (x)  are  specified  functions  (called  sufficient  statistics) 
and  Tj  are  specified  moments  can  be  shown  to  have  the  representation, 
called  an  exponential  model, 


log  f(x)  *  ^  6j  Tj(x)  -  ¥(0! . 0k) 


where 


^ . log  /°°  exp  {  l  6.  T.  (x)} 

-°°  j”l  J  J 


guarantees  that  f(x)  integrates  to  1. 

A  quantile  function  q(u),  0<u<l,  which  maximizes  entropy 
Hqd(q)  ■  log  q(u)  du  subject  to  the  constraints 


1A 


J1  exp  (2uiuv)  fQQ0(u)  q(u)  du 

- — T -  *  p (v)  ,  v=0,+l . 

r  f0Q0<u)  du 

o 

where  foQQ(u)  a  specified  density  quantile  function  must  have 
the  representation,  called  an  autoregressive  model, 

q(u)  =  qo(u>  I  l+otjn(l)e2lTlu+.  .  .+am(m)e2lT:Lum| 

9 .  Exact-Parametric  and  Parameter-select  Estimation  of 

Probability  density  Functions  using  Exponential  Models 
Two  important  exponential  models  for  a  density  f(x), 

-®<x<oo  are  the  normal  density  and  the  gamma  density. 

The  normal  density,  denoted  Normal  (p,o) 

,  <*>  -  ?  • 

1  12 
<t>(x)  -  exp  -  j  x 

/Z?  z 

is  exponential  with  sufficient  statistics  T^(x)  -  x  and 
T2(x)  =  x^. 

The  Gamma  density,  denoted  Gamma  (r,X)  where  X  *  l/o, 

fr,o  <*>  '  5  fr  <!>  • 

fr (x)  ■  *r-1  e~X  .  *>0  . 

-  0  ,  x<0  , 


L5 


is  exponential  with  sufficient  statistics  T^(x)  -  x  and  T2(x) 
=  log  x. 

A  location  scale  parameter  Gamma  density 


r,  u,  o 


(x)  -  i  £r( 


o  ; 


is  not  an  exponential  model.  We  can  treat  it  as  one  by 
estimating  y  (say,  by  the  minimum  of  the  random  sample 

Xi . Xn),  and  treating  X^-y  as  a  sample  from  fr  0(x). 

The  hypothesis  that  the  data  is  fit  by  a  normal  distribution 
versus  the  hypothesis  that  the  data  is  fit  by  a  Gamma 
distribution  can  be  tested  by  forming  an  over-parametrized 
exponential  model  with  sufficient  statistics 

(x)  =  x,  T2(x)  =  x2,  T3(x)  =  x3,  T^(x)  =  log  x. 

The  (order  1)  information  divergence,  or  maximum  likelihood, 

A  A  A  A  » 

estimators  6^,  &2>  ®3»  e4»  which  minimize  information  divergence 

of  order  1  -log  dQ(u)  du,  may  be  found  for  an  exponential 
o 

model  by  solving 

-  Es'Tj) 


where  x ,  » 


Ea[T,J  is  estimated  by 
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The  bi-information  divergence  estimators  0^,  ^ »  0j .  0^, 

which  minimize  information  divergence  /^|log  d0(u)|2du,  may 

o 

be  found  using  least  squares  regression  analysis  techniques  by 
minimizing  with  respect  to  0^,...,0^  the  sum  of  squares 

n-1  _ 

l2  I  log  f(X(j))  -  Uog  f(X..)} 

-9i  (Ti<xo)>  -  <Tk(xo))-  VI2 

Stepwise  regression  is  used  to  suggest  parsimonious  parametrizations . 

Graphical  procedures  to  determine  which  parameter  values 
fit  best  are  as  follows:  estimate  j=2,...,n-l,  by 

adding 

^0(sir)  =  f0(X(j))  *  f(X(j)) 

and  normalizing  the  sum  to  go  from  0  to  1.  One  inspects  its 
graph  to  see  how  it  deviates  from  D(u)  “  u. 

10.  Case  studies  of  bi- information  density  estimation 

The  density  estimators  corresponding  to  the  bi-information 
parameter  estimates  of  the  normal,  gamma,  and  four -parameter 
exponential  models  are  presented  for  four  simulated  random 
samples: 

1)  Exponential  or  Gamma  (r  -  1,  a  -  1) 

2)  Gamma  (r-10,  o  -1) 


3)  Normal  (y  =  0,  o  =  1) , 

4)  Contaminated  normal:  100N(0, 1) ,5N(10, 1) 

In  addition  density  estimators,  using  bi-information 

parameters,  are  presented  for  the  data  set  of  Buffalo  snowfall. 

Bi-information  select  regression  estimation  of  the  parameters 

of  a  4-paramential  exponential  model  with  sufficient  statistics 
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x,  x  ,  x  ,  and  log  x  leads  to  the  conclusion  that  Buffalo 
snowfall  obeys  a  Gamma  distribution.  It  is  equally  well  fit 
by  a  normal  distribution  whose  parameters  are  estimated  by 
minimizing  bi-information  rather  than  order  1  information. 

The  hypothesis  that  Buffalo  snowfall  is  normal  seems  to  be 
acceptable,  but  one  can  question  whether  the  maximum 
likelihood  estimators  (sample  mean  and  variance)  provide  the 
best-fitting  normal  distribution  for  Buffalo  snowfall. 

As  in  Parzen  (1979),  we  reject  a  trimodal  shape  probability 
density  estimate  for  Buffalo  snowfall,  which  has  been  found  by 
several  non-parametric  density  estimation  techniques; 
including  Tapia  and  Thompson  (1978). 
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